Project Work

Document Classification and Analysis on Resumes Data

Document classification to reduce human effort in the HRM.

(Feature Engineering + Exploratory Analysis + Topic Modelling + Bag of Words+Machine Learning Model)

Contents of the Notebook:

1. NLP Understanding
2. Business Understanding

  • • Business Problem
3. Dataset Preparation
  • • Import Libraries
    • o For Extraction of text from different file and making Dataframe
    • o For Model Building and Text Cleaning
4. Data Extraction and DataFrame Preparation
5. Data Understanding
  • • Loading Data
  • • Statistical Summaries
6. Feature Engineering
  • • Statistical Features
  • • Text Features
7. Exploratory Data Analysis (EDA)
  • • Resume Count Distribution over the Label
  • • Resume Label Count (Pie Chart)
  • • Resume Label Count on Percentage (Pie Chart)
  • • Resume Character Count Distribution
  • • Resume Word Count Distribution
  • • Resume Average word Distribution
  • • Resume Top 20 words before removing stop words
  • • Resume Most Common Word Frequency
  • • Resume N-gram Analysis (before cleaning)
    • o Resume Top Bigram Words Frequency
    • o Resume Top Trigrams Words Frequency
  • • WordCloud
  • • Resume Average Word Count Distribution Under Label
  • • Resume Average Character Count Distribution Under Label
8. Text Pre-Processing
9. EDA (After Text Pre-Processing)
  • • Resume Sentiment Analysis
  • • Resume N-gram Analysis (after cleaning)
    • o Top 20 Words in Resume After cleaning data
    • o Top 20 Resume Bigrams Words after cleaning data
    • o Top 20 Resume Trigrams Words after cleaning data
  • • Top Part of Speech
  • • Sentiment boxplot under different label resume
  • • Resume length boxplot under different label resume
  • • Topic Models Analysis
  • • Correlation between variables
  • • Named entity recognition (NER)
10. Extracting vectors from text (Vectorization)

  • • CountVectorizer
  • • Term Frequency-Inverse Document Frequencies (tf-Idf)
  • • Word2vec
  • • GloVe
11. Data Splitting – Train Test Split
12. Building ML models For Text-classification
  • • Logistic Regression
  • • Support Vector Machine
  • • XGB Classifier
  • • Grid Search CV- Naive Bayes
  • • Naive Bayes
  • • Deep Neural Network- LSTM with Glove
  • • Deep Neural Network- Simple Dense Network
  • • KNN Classifier
  • • Random Forest Classifier
  • • Decision Tree Classifier
13. Prediction on all ML Model By using different Vectorization technique
14. Deep Neural Network
  • • Simple dense network on the GloVe features
  • • LSTM on the Glove Features
15. Evaluation on the Model Classification Report and Logloss
16. Finalizing Model Choice
17. Deployment
18. Conclusion & Recommendation




NLP Understanding

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

NLP is currently the focus of significant interest in the machine learning community. Some of the use cases for NLP are listed here:

  1. Chatbots
  2. Search(text and Audio)
  3. Text Classification
  4. Sentiment Analysis
  5. Recomendation System
  6. Quesstion Answering
  7. Speech recognation
  8. NLU (Natural Language Understanding)
  9. NLG ( Natural Language Generation)

A lot of the data that you could be analyzing is unstructured data and contains human-readable text. Before you can analyze that data programmatically, you first need to preprocess it. In this tutorial, you’ll take your first look at the kinds of text preprocessing tasks you can do with NLTK so that you’ll be ready to apply them in future projects. You’ll also see how to do some basic text analysis and create visualizations.

What is Text classification ?

Text classification is one of the important task in supervised machine learning (ML). It is a process of assigning tags/categories to documents helping us to automatically & quickly structure and analyze text in a cost-effective manner. It is one of the fundamental tasks in Natural Language Processing with broad applications such as sentiment-analysis, spam-detection, topic-labeling, intent-detection etc.

Business Understanding

Business Problem- Skills classification based on Resumes provided

We will build a classifier for predicting the person skills based on the description in the resume.

Business Objective

Company Problem: To classify resume to reduce manual human effort in the HRM and financial department.

objective: The document classification solution should significantly reduce the manual human effort in the HRM and financial department. It should achieve a higher level of accuracy and automation with minimal human intervention




Import Libraries

For Dataframe Preparation and extracting text by converting .doc, .docx, .pdf and .ppt file.

For Model Building and Text Cleaning

Data Extraction

Text Extraction from .doc, .docx, .pdf and .ppt file

*Note: Run this section only when to convert the file else working on dataset run from Data Understanding Section.

Extracting Text from all resume in Peoplesoft resume folder and creating a .csv file

  1. Converting .doc file and saving as .docx file
  2. Extracting text from all .pdf, .docx and .ppt file and save it into a .csv file

Extracting Text from all resume in SQL Developer Lightning insight folder and creating a .csv file

  1. Converting .doc file and saving as .docx file
  2. Extracting text from all .pdf, .docx and .ppt file and save it into a .csv file

Extracting Text from all resume in workday resumes folder and creating a .csv file

  1. Converting .doc file and saving as .docx file
  2. Extracting text from all .pdf, .docx and .ppt file and save it into a .csv file

Extracting Text from all resume in React JS folder and creating a .csv file

  1. Converting .doc file and saving as .docx file
  2. Extracting text from all .pdf, .docx and .ppt file and save it into a .csv file

Extracting Text from all resume in Internship folder and creating a .csv file

  1. Converting .doc file and saving as .docx file
  2. Extracting text from all .pdf, .docx and .ppt file and save it into a .csv file

Data Understanding

Loading Data into one dataframe

Statistical summaries

VARIABLE DESCRIPTIONS:

  1. We've got a sense of our variables, their class type, and the first few observations of each. We know we're working with 78 resumes and 5 labels.
  2. There are total 5 labels 'Peoplesoft resumes', 'SQL Developer Lightning insight', 'workday resumes', 'React JS', 'Internship'.
  3. We have :- | | |
    | ---: | ---: | |1. Peoplesoft resumes | 20 | |2. SQL Developer Lightning insight | 14 | |3. workday resumes | 20 | |4. React JS | 22 | |5. Internship | 2 |

  4. Combining all resume we have 104963 words.

  5. And there are no null values in the dataset.

Next have a look at some key information about the variables

All over there are 78 resume and 5 unique Label.

Exploratory Data Analysis (EDA)

Resume Count Distribution over the Label

Inferences: Class distribution- There are more resumes of React JS. The dataset is little imbalanced. We will be applying data balancing technique like SMOTE while building the model.

Resume Label Count (Pie Chart)

Resume Label Count on Percentage (Pie Chart)

Inference: The pie chart shows percentage of data from the dataset under each category.

Resume Character Count Distribution

Inference: The histogram shows the number of characters present in each resume. The histogram shows that resume range from 2k to 18k characters and generally it is between 2.5k to 8k characters.

Resume Word Count Distribution

Inferences: Data exploration at a word-level. It is clear that the number of words in resume ranges from 400 to 2500 and mostly falls between 500 to 700 words.

Resume Average word Distribution

Inference: The average word length ranges between 3.5 to 6.5 with 6 being the most common length.

Resume Top 20 words before removing stop words

Inference: We can evidently see that stopwords such as “and”,”the” and “in” dominate in resume.

Resume Most Common Word Frequency

Inference: From above we can see which words occur most frequently. If we do our word level analysis we can see PeopleSoft occur most frequently in the resume and than Workday, SQL and least ReactJS.

EDA: N-gram Analysis (before cleaning)

Resume Top Bigram Words Frequency

Inference: We can see 'SQL Server' and 'React JS' are mostly related with each other. We got mostly 'experience in' word that means mostly the resume are received are of experience person.

Resume Top Trigrams Words Frequency

Inference: We can see that many of these trigrams are some combinations of word 'experience'

EDA: Wordcloud

Inference: At a glance we can see the most relevant word of our data which are highly indicative to classified our resume. As we can see internship resume do not consists of any software skills.

EDA: Word count & Character count under each category

Resume Word Count Distribution Under Label

Inference: Visualizing the Word count:- In this we can clearly see the word consists in the resume, and Peoplesoft resume has the maximum of all.

Resume Character Count Distribution Under Label

Inference: Visualizing the Character-count: We can see the average character in all the labels The average characters in a Peoplesoft resumes is 7402.05 as compared to an average characters of all other resume.

Text Pre-Processing

EDA (After Text Pre-Processing)

EDA: Sentiment Analysis

Inference: Vast majority of the sentiment polarity scores are greater than zero, means most of them are pretty positive.

EDA: Length of Resume

Inference: The resume length ranges between 2k to 18k with most common length between 3k to 5k.

EDA: N-gram Analysis (after cleaning)

Top 20 Words in Resume After cleaning data

Top 10 Resume Bigrams Words after cleaning data

Top 20 Resume Trigrams Words after cleaning data

EDA: Top Part of Speech

EDA: Sentiment boxplot under different label resume

Inference:The highest sentiment polarity score was achieved by Internship resume category, rest all of the four category are between 0 to 0.2 mostly. The workday resume has the lowest median polarity score.

EDA: Resume length boxplot under different label resume

Inference: The median resume length of React JS & Internship category are relative lower than those of the other resume category.

EDA: Topic Model Analysis

Topic modeling is the process of using unsupervised learning techniques to extract the main topics that occur in a collection of documents.

Inference:

  1. On the left side, the area of each circle represents the importance of the topic relative to the corpus. As there are five topics, we have five circles.
  2. The distance between the center of the circles indicates the similarity between the topics. Here you can see that the topic 5,4,3('React JS, Internship, SQL server') overlap, this indicates that the topics are more similar.
  3. On the right side, the histogram of each topic shows the top 30 relevant words. For example, in topic 1('Workday') the most relevant words are Workday, integration, HCM, business, etc

So in our case, we can see a lot of words and topics associated with PeopleSoft and Workday.

EDA: Correlation between variables

Inference: It shows the relation between different variable, the sentiment (polarity) is in negative. That means word use for resume making are mostly negative.

EDA: Named entity recognition

Named entity recognition is an information extraction method in which entities that are present in the text are classified into predefined entity types like “Person”,” Place”,” Organization”, etc. By using NER we can get great insights about the types of entities present in the given text dataset.

Inference: We have used Spacy to derive NER. We can see that the model is far from perfect classifying. It can not exactly drive the skills from the resume, But yes we can get little bit of information like in which organisation does the person worked, country where he worked and the language know by him. For skills we need to explore more.

Final Inference from EDA

we discussed and implemented various exploratory data analysis methods for text data.

Model Building

Extracting vectors from text (Vectorization)

The process to convert text data into numerical data/vector, is called vectorization or in the NLP world, word embedding. Bag-of-Words(BoW) and Word Embedding (with Word2Vec and Glove) are two well-known methods for converting text data to numerical data.

Bag of Words

  1. Count vectors: It builds a vocabulary from a corpus of documents and counts how many times the words appear in each document.
  2. Term Frequency-Inverse Document Frequencies (tf-Idf):TFIDF works by proportionally increasing the number of times a word appears in the document but is counterbalanced by the number of documents in which it is present.Term frequency is defined as the number of times a word (i) appears in a document (j) divided by the total number of words in the document.Inverse document frequency refers to the log of the total number of documents divided by the number of documents that contain the word. TFIDF is computed by multiplying the term frequency with the inverse document frequency.

Word2Vec

One of the major drawbacks of using Bag-of-words techniques is that it can’t capture the meaning or relation of the words from vectors. Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network which is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

We can use any of these approaches to convert our text data to numerical form which will be used to build the classification model.

Glove

GloVe stands for Global Vectors for word representation. It is an unsupervised learning algorithm developed by researchers at Stanford University aiming to generate word embeddings by aggregating global word co-occurrence matrices from a given corpus.GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Difference in Glove and Word2Vec

Glove model is based on leveraging global word to word co-occurance counts leveraging the entire corpus. Word2vec on the other hand leverages co-occurance within local context (neighbouring words). In practice, however, both these models give similar results for many tasks.

GloVe Features

tokenizer create tokens for every word in the data corpus and map them to a index using dictionary.

word_index contains the index for each word

vocab_size represents the total number of word in the data corpus

Data Splitting- Train Test Split

Vectorization using Bag-of-Words (with Tf-Idf and Countvectorizer ) , Word2Vec and Glove

Building ML models For Text-classification

  1. Logistic Regression
  2. Support Vector Machine
  3. XGB Classifier
  4. Grid Search CV- Naive Bayes
  5. Naive Bayes
  6. Deep Neural Network- LSTM with Glove
  7. Deep Neural Network- Simple Dense Network
  8. KNN Classifier
  9. Random Forest Classifier
  10. Decision Tree Classifier

Prediction on all ML Model By using different Vectorization technique

1. CountVectorizer

Prediction on all ML Model By using different Vectorization technique

2. Tf-Idf

Prediction on all ML Model By using different Vectorization technique

3. Word2Vec

Prediction on all ML Model By using different Vectorization technique

4. Glove

Deep Learning- Deep Neural Network

1. Simple dense network on the GloVe features

Deep Neural Network

2. LSTM on the Glove Features

Evaluation on the Model Classification Report and Logloss

Lets Understand Confusion Matrix First:

1. Precision Number of True Positives (TP) divided by the Total Number of True Positives (TP) and False Positives (FP).Of all the positive predictions, how many of them are truly positive.
2. Recall Number of True Positives (TP) divided by the Total Number of True Positives (TP) and False Negatives (FN).Layman definition: Of all the actual positive examples out there, how many of them did I correctly predict to be positive?.If you compare the formula for precision and recall, you will notice both looks similar. The only difference is the second term of the denominator, where it is False Positive for precision but False Negative for recall.
3. F1 Score Both precision and recall, the F1 score serves as a helpful metric that considers both of them. If we express it in terms of True Positive (TP), False Positive (FP), and False Negative (FN).Use F1 score as an average of recall and precision, especially when working with imbalanced datasets. If either recall or precision is 0, F1 score will reflect that an also be 0.
4. Macro Average Compute the average without considering the proportion.This method treats all classes equally regardless of their support values.
5. Weighted Average The weighted-averaged F1 score is calculated by taking the mean of all per-class F1 scores while considering each class’s support.
6. Log Loss Log Loss is the most important classification metric based on probabilities. log-loss is still a good metric for comparing models.A lower log-loss value means better predictions.
7. Mean Absolute Error (MAE) MAE is the average absolute error between the model prediction and the actual. The lower the MAE, the more closely a model is able to predict the actual observations.
8. Root Mean Squared Error (RMSE) The lower the RMSE, the more closely a model is able to predict the actual observations.

Understanding the model classification report, logloss MAE and RMSE from above table

The above table shows the type of vectroziation used for machine learning model and its metricis on precision, recall, f1-score, accuracy and logloss. We have taken the metricis of macro average as our dataset is imbalanced and all our classes are equally important, using the macro average would be a good choice as it treats all classes equally. Since the dataset is very less, the accuracy is good as it is overfitting. In order to handle overfitting, LOOCV has to be performed.

From all the above model metrics, we came up that below models given us good result:

Type of Vectorizer Machine Learning Model precision recall f1-score MAE RMSE accuracy log-loss
CountVectorizer XGB Classifier 1.000 1.000 1.000 0.019 0.136 1.000 0.112
Tf-Idf XGB Classifier 0.980 0.960 0.970 0.056 0.236 0.960 0.129
Tf-Idf Random Forest Classifier 1.000 1.000 1.000 0.074 0.272 0.960 0.393
CountVectorizer Logistic Regression 1.000 1.000 1.000 0.093 0.304 1.000 0.007
Glove Deep Neural Network- Simple Dense Network - - - - - 0.958 0.113

So, we came up that model XGB Classifier under Tf-Idf vectorization is giving us least MAE error with good accuracy and least log-loss as this model is not overfitting the data.

Although we are not sure that this model is the best as our dataset is less, including more data would give us best result.

Finalizing Model Choice

Full Training

Deployment

Saving Model (and anything else as pickle file)

Conclusion & Recommendation

We have walked through a complete end-to-end machine learning project using the employee Resume File. We started by converting the .doc file to .docx, and then extracting the text from the .docx & .pdf file and storing the data as .csv file to make a dataframe. Then we implemented various exploratory data analysis method for text data. Through which we came to know about sentiment analysis, top words used in document by wordcloud, information used in resume by Named entity recognition and lastly Topic Model Analysis by which we can similarity between topics and relevant words used under them. We used basics of building a text classification model comparing Bag-of-Words (with Tf-Idf & Countvectorizer) and Word Embedding( with Word2Vec & Glove). Finally, we trained a variety of classifiers and dense neural network and got the model accuracy with classification report, logloss, MAE, RMSE and confusion matrix heatmap.

  1. Both models and our data analysis had classified the resume correctly.
  2. As, our dataset is having less resume so classification sometime may not be correct, this can be reduced by training model with more data.